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The classification of semantic segmentation-based unmanned aerial vehicle 
(UAV) application based on the datasets used in this work and the necessary 
data preprocessing steps for the optimization and implementation of the 
models are also involved. The optimization of the various models was done 
using the evaluation metrics and loss functions because deep neural 
networks (DNNs) are just about writing a cost function and its subsequent 
optimization. convolutional neural network (CNN) is a common type of 
artificial neural network (ANN) that has found application in numerous 
tasks, such as image and video recognition, image classification, 
recommender systems, financial time series, medical image analysis, and 
natural language processing. CNN is developed to automatically and 
adaptively learn spatial feature hierarchies via backpropagation using 
numerous building blocks, such as pooling, convolution, and fully connected 
layers. The result of identification was excellent. The image segmentation 
was detected and comprehend the actual components of an image down to 


the pixel level. The result created an entire image segmentation masks with 
instances using the new label editor in the label box. 
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1. INTRODUCTION 

These are models whose architecture is made up of layers arranged in stacks. The models of each 
stage learn a useful pattern by successively filtering the input data. Figure 1 depicts a basic 4-layer neural 
network for the classification of handwriting from the Modified National Institute of Standards and 
Technology (MNIST) dataset [1]. The four components of a system for training a neural network include: 
i) Layers; ii) Input data; iii) Loss function; and iv) Optimizer. 

Furthermore, neural networks are trained to identify the correct weight values that will specify the 
transformations to be performed on the incoming data [2]. Figure 2 depicts the network layers 
parameterization process. The setup of these weights is challenging to optimize for varied tasks because there 
can be millions of parameters that are interdependent. A loss function is utilized to generate a weight 
configuration because it measures the extent of performance of the layer representations from the network 
compared to the predicted output. The first stage of the backpropagation (BP) algorithm is the determination 
of the final loss value [3], followed by computation of each parameter’s contribution in the computed loss 
value from the top layer to the bottom layer [4]. The loss function derivative can be determined at a given 


Journal homepage: http://ijai.iaescore.com 


642 0 ISSN: 2252-8938 
training point by identifying the required adjustments to minimize the loss and optimize the model to a given 


evaluation metric known as stochastic gradient descent [5], [6]. 
The Figure 3 shows the final representation of a neural network training loop. The weights are 


randomly initialized (using a random initializer) in this training loop, which results in a high loss score because 
the initial weights are likely unsuitable for capturing the data patterns in a usable manner. The network, on the 
other hand, adjusts its weights with each training batch and gradually improves until it has a small loss [7]. 
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Figure 3. Training loop for a neural network 


Int J Artif Intell, Vol. 12, No. 2, June 2023: 641-647 


Int J Artif Intell ISSN: 2252-8938 o 643 


2. METHOD 
2.1. Dense layers 

Traditionally, densely linked layers, also known as completely connected layers [8], [9], have been 
used for datasets other than images, in which a form of connection exists between the neurons from one layer 
to the other. These dense layers [10], [11], on the other hand, can only learn patterns that fall within their 
input feature space, while each convolution filter is capable of learning patterns locally (kernel space or 
region of interest) [12]—[14]. This implies the possibility of fragmenting input images into edges and textures 
foreasy learning; they are also more useful for classification rather than global patterns [15]. 


2.2. Convolutional neural networks (CNNs) 

CNNs are a sort of neural network that is largely utilized in deep learning (DL) for tasks in 
computer vision [16]-[18]. CNNs are similar to traditional neural networks, except that rather than generic 
matrix multiplication [19], [20], CNNs depend on a convolution kernel that is found in one or more of the 
CNN network layers [21], [22]. CNNs can be implemented using image tensors as input because they can 
extract the relevant image features for classification or differentiation, and then output classifications [23], 
[24]. CNNs have found wide application because they can use relevant filters to learn translation-invariant 
patterns and spatial hierarchies. Translation invariant patterns learning implies that the network can identify 
similar patterns wherever else in the image after learning it in one region. This ensures image processing 
efficiency owing to the need to learn just few training samples that can be generalized [25]. 

Spatial hierarchies show how the network's initial levels can learn few local patterns that can be 
generalized into larger patterns made up of these few patterns in the subsequent layers. This enhances the 
capability of the network to acquire abstract and complex visual concepts. It also allows these networks to 
produce predictions of greater accuracy rather than just vectorizing a complicated image with heavy reliance 
on pixels. To track these key properties and interactions between features, CNNs must be used in any image 
classification operation [26]. 


2.3. Convolution 

Convolution is a procedure that combines two functions of a real-valued parameter to create a new 
function. Let w(a) represent the weighting function that prioritize recent measurements, s(t) represent the 
time-based output estimate function, and x(a) represent the time-based input position function. Based on 
these defined functions, the generic equation for convolution is shown in (1) and Figure 4 shows visual 
representation of dimensional convolution [27]. 


s(t) = f x(a)w(t — a)da (1) 
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Figure 4. Visual representation of dimensional convolution 


Input 


The x(a) is the input function, w(a) is the kernel, and s(t) is the feature map in CNNs. In CNNs, 
convolutions are done on 2-D tensors made up of the input images’ height, width, and color channels [28]. 
These operations take patches from the input and apply transformations to create a feature map with varying 
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depths as an output. The output feature map depth, which indicates the number of filters built by the layer 
and each of which can encode a different input data feature, is one of the two key criteria for creating these 
convolutions, coupled with the size of the extracted patches [29]. 

The computation of 2D convolutions is done by moving a square window of defined width and 
height across all input feature map pixels. For images, the input feature map has three dimensions which are 
height, breadth, and color bands (corresponding to red, green, and blue). When these feature maps are 
subjected to 2D convolution, -2-D patches of the surrounding features are created. Then, these patches are 
turned into a 1-D vector that reflects the output depth using a convolution kernel. After that, the vectors are 
spatially reconstructed into a 2-D output map that corresponds to all the input map's locations. Figure 4 
depicts the process of this convolution [30]. 


2.4. Strides 

The stride of the convolution is a characteristic that can determine the size of the output in CNNs. 
Stride is a convolution operation parameter that specifies the distance between the derived patches from the 
input feature map. A stride of 2 implies that the output feature map's width and height are downsized by a 
factor of 2 without padding as shown in Figure 5 [31]. 


2 X 2 Kernel 


5 X 5 output 


6 X 6 input 


Figure 5. Strides 


2.5. Semantic segmentation 

Each pixel in an image is given a class label using the image classification approach known as 
semantic segmentation. For aerial imagery, semantic segmentation for use in a feature extractor is carried out 
to locate pertinent areas of an image that may point to invariance between seasons and times of the day. Even 
though fully convolutional networks (FCNs) can be used for this task, precise/accurate image segmentation 
and localizations learning necessitates a large amount of data. Due of the lengthy time required to manually 
label satellite images with many classifications, there are not many datasets with extensive pixel-by-pixel 
labels that are publicly available. Hence, FCNs and U-Nets are commonly employed in aerial imagery 
segmentation to address both issues [32]. 


2.6. Fully convolutional networks (FCN) 

The advent of a variant of the CNN known as the FCN represents a big advancement in picture 
segmentation. The difference between FCN and the traditional CNN is that in the CNN, the completely 
linked terminal layers are transformed into convolution layers, resulting in the creation of a nonlinear filter 
for each output vector layer in the network. As a result, the completed network can function on inputs of any 
size and produce outputs with the same spatial dimensions. Hence, the classification network can generate a 
heatmap of the selected item class. The addition of layers and a spatial loss to the network results in an 
efficient scheme for end-to-end dense learning. Figure 6 depicts an example of this transformation [33]. 

FCNs are not only more flexible because they can take a variety of input image sizes, but they've 
also been shown to be more efficient for learning dense predictions thanks to in-network up-sampling. The 
FCN may also keep track of the input's spatial information, which is important for semantic segmentation 
because it requires both classification and localization. Although an FCN can accept any size input image, 
non-padded convolutions are used to decrease the output resolution. These were created to keep filter sizes 
modest and reduce the computational demands. The outcome is a coarse output with a size reduction equal to 
the pixel stride of the output units' receptive field [34], [35]. 


Int J Artif Intell, Vol. 12, No. 2, June 2023: 641-647 


Int J Artif Intell ISSN: 2252-8938 o 645 


Forward / inference 


backward / learning 


Figure 6. FCN making dense predictions for per-pixel tasks such as semantic segmentation 


3. RESULTS AND DISCUSSION 

The unmanned aerial vehicle (UAV) system benefits from this model application especially with the 
limited resources of UAVs system. Fully autonomous UAVs need to understand their surrounding 
environments in detail. Always the surouding objects are in 3D while token pictures were shown in 2D which 
is critical for high-level decision-making. For example, always dorones need to understand their surrounding 
environment in real time; therefore, real-time semantic mapping is significant and worth exploring in this 
type of drone application. Such application and processes will be effective for most important factor of UAV 
which is the power resources, so semantic segmentation will reduce the time and used power to process video 
images or frames in order to reach a final decision [36]. 


4. CONCLUSION 

Using FCN has been shown to be not only more flexible as images can be obtained with different 
input sizes, but also more efficient as the network uses upsampling to learn dense predictions increase. The 
input's spatial information can also be preserved using FCN, especially for semantic segmentation 
applications. This task requires both classification and localization, hence, any image size can be used as an 
input with FCN, even though the output resolution is negatively impacted by convolution without padding. 
Hence, these were introduced as a way of reducing the computational demands and filter sizes. The result is a 
coarse output that is reduced in size by a factor equivalent to the pixel pitch of the incoming field of the 
output unit. 
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